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Abstract. "Nishimori line" is a line or hypersurface in the parameter space 
of systems with quenched disorder, where simple expressions of the averages of 
physical quantities over the quenched random variables are obtained. It has been 
playing an important role in the theoretical studies of the random frustrated 
systems since its discovery around 1980. In this paper, a novel interpretation 
of the Nishimori line from the viewpoint of statistical information processing 
is presented. Our main aim is the reconstruction of the whole theory of the 
Nishimori line from the viewpoint of Bayesian statistics, or, almost equivalently, 
from the viewpoint of the theory of error-correcting codes. As a byproduct of 
our interpretation, counterparts of the Nishimori line in models without gauge 
invariance are given. We also discussed the issues on the "finite temperature 
decoding" of error-correcting codes in connection with our theme and clarify the 
role of gauge invariance in this topic. 
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1. Introduction 

There are not many rigorous results that are useful for the study of random frustrated 
systems. Among them, theorems related to the Nishimori line of random spin models 
form an important family. There have been many papers [1, 2, 3, 4, 5] about the 
Nishimori line after the seminal paper [1] of Nishimori. There is, however, still a 
mystery about the Nishimori line, i.e., its physical meaning and the motivation behind 
the proof are not yet clear. 

The purpose of this paper is to provide a novel interpretation on the Nishimori 
line from the viewpoint of statistical information processing, more specifically from 
Bayesian statistics or from the coding theory. Our interpretation has two advantages. 
First it gives an interesting example of an unexpected relation between two different 
areas, rigorous arguments in the statistical physics and Bayesian statistics, and 
elucidates the meaning of the trick in the derivation of the Nishimori line. Secondly it 
gives some new results on the analog of the Nishimori line without gauge invariance 
in the sense of Toulouse [6] . 

Our arguments are closely related to the works on the "the optimality of finite- 
temperature decoding" of error-correcting codes [7, 8, 9, 10] . In fact, some of our 
results are essentially given in Sourlas [10]. In these works, however, finite-temperature 
decoding is discussed with spin glass theory. On the other hand, our aim here is reverse 
the direction of the arguments and discuss the whole theory of the Nishimori line from 
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the viewpoint of statistical science. We will discuss further on the finite-temperature 
decoding in the last section. 

In this paper, wc make efforts to give a self-contained description of this 
material. No special knowledge on Bayesian statistics, error-correcting codes and 
gauge invariance of spin glass is assumed. 



2. Bayesian Framework 

In this section, we give basic notions and terminology of Bayesian statistics. We also 
discuss identities and inequalities that naturally arise from the Bayesian framework. 
Although the motivation for these formulas as well as their proofs are quite simple, 
they are essential in the derivation of the properties of the Nishimori line. 

Let us assume that our data y is generated by a probability distribution p(y\x), 
which is parameterized by the value of an unknown variable x. In the Bayesian 
framework, we also assume that the parameter x is, in itself, a random sample from 
a prior distribution ir(x). With these assumptions, the probability distribution of the 
parameter x conditioned on given data y is 

p(y\xMx) 

Here J2 X means the summation or integral over the possible values of x. This 
distribution, the posterior distribution, is the source of knowledge with given data 
y in the Bayesian formalism. 

Similar formalism is also used in seemingly different branch of the information 
science, the theory of error-correcting codes. Consider a noisy channel and a set of 
messages. We encode and send a message x through the noisy channel and someone 
at the other end of the channel tries to infer the original message x from the output 
y. If we assume that the probability p{y\x) of an output y with the input x and 
the distribution ir(x) of the average frequencies of input messages, the conditional 
probability p(x\y) of an input x with the output y is given by (1). Note that the 
probability p(y\x) represents the coding scheme as well as the noise of the channel in 
this formalism. 

We will introduce notations that indicate the averages over different types of 
distributions. Here the symbol A(x) denotes a function of the parameter x and B{y) 
denote a function of the data y . First we define the average over the prior distribution 
of x, 

[A{x)U x) =]TA(x)7t(x). (2) 

x 

We also define the average over the posterior distribution of x, 

{A{x)) p(x \ y) =Y J A (x)p{x\y). (3) 

X 

Finally we define the average over the probability distribution p(y\x) of data y with 
the given parameter x, 

[B(y)Uv\ x )=J2B(y)p(y\x). (4) 

v 

These notations are not common in the literatures on Bayesian statistics. They are 
introduced to contrast the analogy to the statistical physics of systems with quenched 
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disorder. Later we show that the averages [ ] correspond to the quenched average over 
the configuration of the impurities and ( } corresponds to the thermal average. 

Let us consider relations among these averages. First we note that an identity 

[( A { x '))p(x'\v)] P (v\x)\tt(x) = [M X )]n(,x) i (5) 

holds. The posterior average {A{x')) p ( x i\ y ) can be regarded as an estimate of A(x) 
from the data y. It is a random variable dependent on y and the identity (5) 
shows that the average of it over the possible values of the data y and the original 
parameter x coincides with the prior average [A(x)] 7r ( :E ). The proof of the formula (5) 
is straightforward. When we substitute the left hand side of (5) for the definition of 
the averages (3), (4), (2), we obtain the expression, 

By changing the order of the summation and a dummy index, we can show that the 
factors ^2 x p{y\x)ir{x) in the numerator and denominator cancel each other. Using 
J2 y P(v\ x ') = 1 an d J2 X ' A(x')ir(x') — [A(x)] w ( x j, the proof of (5) is completed. 
It is easy to generalize (5) to an identity 

[(C<y \y))p(.x'\y)}p(y\x)}n(x) = [[C(x, y)]p(y\x)]w(x) . (?) 

Here C(x,y) is a function of the data (the output of the channel) y as well as the 
parameter x. The proof of the relation (7) is essentially the same as that of (5). The 
only difference from (5) is that the average [ ] p ( y \ x ) m the right hand side cannot be 
removed. 

In these arguments, we assume that the "true" distributions p(y\x) and ir(x) 
behind given data are exactly known. They are, however, often unknown in a real 
world example. A way to fill this gap is to include "hyper parameters" a and 7 in the 
expression of p(y\x) and ir(x) and estimate them from the data. Hereafter we use the 
notation p a (y \ x) and 7r 7 (x) to indicate the distributions that contain hyperparameters. 
An approach to estimate hyperparameters a and 7 from the data y is the minimization 
of a free-energy-like quantity, 

F(a,7) = -log^2p a (y\x)nj(x). (8) 

X 

Note that the procedure based on the marginal likelihood ^2 x Pa(y\x)n-y(x) is 
successfully used by the many authors in practical problems. It is called by a lot of 
different terms, say, the maximization of type II likelihood [11, 12], the minimization 
of ABIC [13], the maximization of evidence [14, 15], and, simply, the maximization of 
the likelihood of a and 7 [16, 17, 18] %. 

At the moment, we assume that the form of the distribution 717 (x) and p a {y\x) 
is correctly known except the values of the hyper parameters. Even in this case, 
the hyperparameters (a, 7) that maximize (8) are random variables dependent on the 
data y and they fluctuate around the true values (ao,7o) of (a, 7). However, (a, 7) 

X When the expression of the probability p(y\x) is considered as the function of x with given data 
y, it is called "likelihood of the parameter x" . This terminology is preferred by the non-Baycsians 
who do not treat the parameter 1 as a random variable, but is also used by Bayesians. The mixture 
distribution Pa(x\y)n^(x) at the right hand side of (8) can be regarded as the likelihood of the 
hyperparameters a and 7. 
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that minimize the average of F(a,j) over the true distribution 7r 70 (x) and p ao (y\x) 
coincides with the true value (a ,7o), i.e, the inequality 

[[F(uO,Jo)}p ao (y\x)}ir-, (x) < [[F(a,l)] Paa (y\x)}^ a (x) (9) 

holds for any value of a and 7. 

If the right hand side of (9) is a sufficiently smooth function of (a, 7), the 
derivatives of F at (a, 7) = (ao,7o) should be zero. For example, the following 
relations are direct consequences of (9). 



d 



[[-Q^ F (an)} Pao (y\x)U 



(x) 



,0*0 



(a,7)=(«o,7o) 



(a,7)=(a ,7o) 







(10) 



(11) 



Here, the derivatives dF/da and dF/d-f should be interpreted as the derivatives 
dF/dak and dF/d-fk with each component of a = and 7 = {7^}, when a and 7 
are vectors with more than one components. The conditions on the second derivatives 
are also derived from (9) by using positive semi-definiteness of the Hessian, say, 



[[^(a,7)k 



> 



(«,7) = ("o,7o) 

which ensure that (ao,7o) is a relative minimum of (9). 

A simple way to prove (9) is the use of the Gibbs inequality, 



(12) 



(13) 



where P(z) and Q(z) are arbitrary functions that satisfy the relations < 
P(z),Q(z) < land£ z Q(z) = Z x P{z) = 1. IfwesetP(y) =E I Po„(|/WsW and 
Q(y) = ^2Pa(y\x)n 7 (x), it is easy to verify the requirement of the Gibbs inequality 
(13). Then it follows that, for any a and 7, 



log 



ExPoo {vY- 



'7o 



(X) 



< 



(14) 



This proves (9) and its corollaries (10), (11), (12). We can also prove (10), (11), (12) 
through direct calculations similar to that for (5). 

The history of Bayesian statistics is long and complicated. It originates from 
the Laplace's works on the "inverse probability" and has been an archetype of 
mathematical theories of uncertain objects. Despite different interpretations and 
objections to the use of prior distributions, it is an important language in a wide area 
of the sciences of information processing, say, time-series analysis [13, 16, 19], image 
restoration [20, 21, 22, 17, 23, 24, 18, 25], inference with neural networks [15, 26], and 
artificial intelligence. An earlier remark on the analogy between Bayesian statistics 
and statistical mechanics is found, for example, in Iba [27]. Sourlas [28] seems the 
first reference that discussed the relation between the coding theory and spin glasses. 
We also refer recent works of Bruce and Saad [29] , Pryce and Bruce [30] , Tanaka and 
Morita [31], Kabashima and Saad [32, 33], Dress et al. [34], Opper and Winthcr [35], 
which treat the relation between statistical mechanics and Bayesian statistics (or error- 
correcting codes). 
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3. The Nishimori Line 

Now we discuss the relation between the results in the previous section and the 
Nishimori line of spin glasses. To see this in the simplest case of ± J Ising spin glass, 
we set the distributions as follows: 

Pa(y\x) = ^-exp(-£' a (a;,y)) i (15) 



-E a {x,y) = a^2yijXiX j: (16) 



Z n — 



= }_^eM~E a (x,y)) = (exp(a) + cxp(-a)) M (17) 



and 



tt(x) — ^ (the uniform distribution) (18) 

where each of the component ir, (i G {1.../V}) of the parameter x takes the value of 
±1. The component xt is defined on the vertices i of a graph G, say a square lattice 
or a random network, of degree N. The data y = {y^} is defined on the edges (i,j) 
of G and the summation runs ovcr them. We denote the number of the edges 

of G as M, which is also the number of the data. 

This probability p a (y\x) corresponds to a binary symmetric channel where a set 
{Uij} *= G) of the pair product yj" = XiXj of the inputs is sent as an error- 

correcting code [28, 10, 8, 32]. Here, "binary symmetric" means that the output of 
the channel yij is given by the formula 

jjij = +y l ij with probability 1 — q 

Vij — ~Vij with probability q, (19) 

If we assume that the data yij is generated by p ao (y\x) and ir(x) defined by (15) and 
(18), the noise q of the channel is related to the hyperparameter ao by 

exp(-a ) 



exp(ao) + exp(-a ) 



(20) 



Although "pair product code" y™ — XiXj defined in the above looks rather artificial 
one, recent works [32, 33] on error-correcting codes suggest its generalization might 
have practical importance §. 

The posterior distribution of the model with data {yij} and hyperparameter a is 

Pa(x\y) = exp(-E a (x,y)), (21) 

§ Another interesting interpretation of the probability p a (y\x) is given by a problem [36] that arises 
in the analysis of social network data [37] . With this interpretation, the index i indicate a person and 
the binary variable yij £ {±1} (yij = Vji) indicates a social relation between persons i and j, say, 
whether they have an acquaintance or not. Each person is assumed to subject to one of the social 
groups A and B, and the problem is to infer the group structure from the data {yij}. We set the 
indicator Xj = 1 when i £ A and Xi = — 1 when i £ B and assume the following property: 

If a pair of the persons i and j is in the same social group, yij = 1 with a probability q and 
— 1 with a probability 1 — q. Else if they are in different groups, yij = 1 with probability 
q' and —1 with a probability 1 — q'. 

Then we get the probability p a (y\x) in the text as a special case where q' = 1 — q. 
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Zpos = ^2exp(-E a (x,y)). (22) 

X 

This is the Gibbs distribution of a random bond Ising model with coupling constants 
{yij} defined on the graph G. The derivatives of the function F defined by (8) are 

d 1 

—F{a,j) = --(E a (x',y)) Poi{x ,\ y) - M-tanha (23) 

where ( ) Pa t x \ y ) indicates the canonical average with the energy E{x,y) (Here and 
hereafter, we assume that we are working at the unit temperature T = 1 and a is 
treated as a (hyper)parameter of the model but not the temperature.). The term 
M ■ tanha comes from the derivative of the logarithm of the normalization factor 
(exp(/i) + exp(— h)) M of the probability p a (y\x). 

In general, a misspeficification of hyperparameter a in (21) and (23) is possible. 
If we assume that we know the "true" value ao used in the generation of the data 
{yij} and set a — ao, or, equivalently, 

; M - a \ r = O (24) 

exp(a) + exp(— a) 
in the formulas, we have an identity on the energy 

-[{E a {x',y)}p a ( x '\y)]p a (y\ x )]n(x) =aM-tanha. (25) 

from the identity (10). 

So far, the average over the bond randomness 

[ [' "]p„(y|x) ]tt(x) (26) 

has a rather complicated structure. There are two steps in the generation of 
the quenched random variables {yij}, which are described by tt(x) and p a (y\x) 
respectively. In this particular case, we can simplify it using gauge invariance of 
the problem. The gauge transformation group of this model is defined by the family 
of transformations, 

Vz ■ {Vtj} {zi -yi 3 -Zj} (27) 

U z : {Xi} -+ { Xl ■ Zi } (28) 

parameterized by z = {z^, Zi E {±1}- This set of transformations consist of one-to- 
one onto-mappings (permutations) of their domain and satisfy a transitive property, 
i.e., there exists z with which U z (x) = x' for any pair of x and x' in the domain. It is 
easy to show the following relations: 

p a (U z (x)\V z (y))=p a (x\y) (29) 

tt(U z (x))=tt(x) (30) 

Pa{V z (y)\Uz(x))=p a (y\x), (31) 

E a (U z (x),V z (y))=E a (x,y). (32) 

They are an example of the gauge invariance (or gauge covariance) in a random 
spin system ([6, 38, 1, 2], see [5] for a comprehensive treatment with applications 
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to the Nishimori line.). By these formulas, we can show that the left hand side of the 
expression (25) is written as a simpler average 

[[(E a (x', y))p a (x'\y)]p a (y\x)]n(x) 

= XIXIX^^''^ ■Pa{x'\y)-p a {y\x) ■ n(x) 



x y x 1 



= EEE^^^'),^^)) -P a (U z (x')\V z (y)) -P a (V z (y)\U z (x)) ■ n(x) 

x y x' 

= ^2^2E a (x',y) -Pa(x'\y) -p a (y\x*) 
y x' 

= [{E a {x' \y))p a (x'\y)}p a ( y \x') (33) 

where x* is a ferromagnetic state {x*} (Vix* — 1) and z is a function of x that satisfy 
x* = U z (x). The existence of such z is secured by the transitive property and the 
change of the dummy index, say, from V z (y) to y in the summation ^ , is justified 
by the one-to-one onto property of the gauge transformations (27), (28) respectively. 
Here and hereafter, we denote the average over the distribution 

p a (y\x*) = =^ (34) 

as [ ] q , where the relation between q and a is defined in (24). It is easy to see a 
component yij of sample from the distribution (34) is mutually independent samples 
from the distribution 

Prfoy) - q ■ 6( Vij - 1) + (1 - q) ■ S( Vij + 1). (35) 

By using (33) and these notations, the formula (25) is reduced to the identity 

-[(E(x, y)) Pa (x\y)\q = aM ■ tanha (36) 

on the average of the energy of ±J spin glass with the coupling constants {yij} from 
the distribution (35). It is nothing but a result reported in the first paper [1] on the 
Nishimori line. 

The relation (24) between (hyper)parameter a in the canonical average and the 
noise level q in the quenched average is essential and known as the definition of the 
Nishimori line of the model. In our derivation, it arises from the condition a = ao in 
the formulas (9) and (10). This means that the model p a (y\x) assumed in the analysis 
of the data coincides with the "true" probability p ao (y\x) used in the generation of the 
data. In terms of the coding theory, it corresponds to the situation where the decoder 
knows exactly the property of the channel, the coding, and the relative frequencies of 
the possible messages. It is rather surprising that the notion of the Nishimori line ||, 
which is introduced without any background of statistical information processing, has 
such a clear interpretation from our point of view. 

A similar argument with the substitution of the second derivative -J^F(a,^) of 
F into the expression (12) leads to an inequality 

[[(El(x\y)) Pa (x'\y) - {Ea{x\y)) pa ( x ,\ y) \p a (y\x)]-K{x) < ^2 ~ ■ ( 37 ) 

|| In general cases, where more than one (components of) hypcrparameters are contained in the model, 
it is actually "Nishimori hypersurface" . The term Nishimori temperature is also used by statistical 
physicists. It seems, however, not adequate terminology in the context of information processing, 
because the notion of temperature has no specific meaning in the problems in the statistics and the 
coding theory. 



The Nishimori line and Bayesian Statistics 



8 



With gauge invariance of the model, we can derive an inequality 

[{E a (x,y) 2 ) Pa{xM - (E a (x,y)) 2 {xly) ] q < " ^ (38) 

cosh a 

from (37) . This is an inequality on the fluctuation of the energy (the specific heat) 
on the Nishimori line, which is also discussed in [1]. Some of the other relations that 
hold on the Nishimori line of the model is derived from the identity (7) and the gauge 
invariance of the model. For example, the distribution of the internal fields at the 
vertex i [3] is reproduced, when we set C(x,y) — J2j Vioj x ji where j runs over the 
set of vertices neighboring to io, i.e., (io, j) G G. The expression of the gauge invariant 
correlation function [1] is also derived, when we set C(x,y) — Xk • (K(i,j)er Vij) ' x h 
where T denotes a path that connects the vertices fc and I. 

Here we discuss a statistical model defined by (15) and (18), which leads to 
the Nishimori line of the ±J spin glass model. Our argument is, however, general 
and can be applied to the Nishimori line of other models, say, spin glasses with 
a Gaussian distribution of the coupling [2], models with multiple spin interactions 
[28, 8, 5, 34, 32, 33], and the gauge glasses [5, 9]. For each model, we can consider 
the corresponding statistical model (or a noisy channel) and derive the properties of 
the Nishimori line from the relations (10), (12), (7) with additional arguments on 
the gauge invariance. For example, we consider the following problem of statistical 
inference: 

There are a collection of the phase variable {xi} (xi e [0, 2tt) ) on the vertices 
of a graph G, say, a two-dimensional lattice. The numbers of the vertices 
and edges in G are N and M, respectively. We assume that the difference 
Vij = Xi~ xj of the parameters Xi and xj is observed for each edge (i, j) in 
G with a probabilistic error rjij. The problem is to infer the original values 
of {xi} from the data {yij} with the assumption that yij = y™ + rjij. 

Such a problem might have practical importance in the analysis of the data from 
optical measurements where differences of the phase between neighboring points are 
observed with noise. If we assume that the magnitudes of the noise {rjij} are mutually 
independent and obey a von Mises distribution, a correspondence of a Gaussian 
distribution on a circle, the probability is 

Pa(y\x) = ^-exp(-E a (x,y)) t (39) 
-E a (x,y) =a^2cos(xi - xj - y^), (40) 

Z a = J ITdx, p a (y\x) = (27r/ (a)) M (41) 

where I indicates the modified Bessel function. We also assume the uniform prior 

n{x)dx = {^-) N ■ IT, d Xl , x, G [0, 2vr). (42) 

From this setting, we can derive the results on the Nishimori line of the gauge glass 
[5, 9] with a method that is similar to that for the ± J spin glass. 
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4. What is New ? 

Now, careful readers may ask what is really new in our approach. Once the expression 



of the left hand side of (36) is derived by the gauge invariance from that of (25), it 
is not difficult to show the relation (36) by direct inspection of the expression. If we 
combined these steps of the proof in this order, it is nothing but a conventional proof 
[2] of the property of the Nishimori line. The same is true for the derivation with the 
identity (7). In this sense, our argument is not a re-derivation of the Nishimori line 
but a reformulation or re-interpretation of the original derivation. 

There are, however, two major advantages of our approach. First, our 
interpretation elucidates the meaning of the Nishimori line. It is a line on which 
we make inference (or decoding) using the "true" probability structure that generates 
the data (or codes). This coincidence of the encoding and decoding scheme gives 
drastic simplifications of the averages of various kind of physical quantities. This 
interpretation also explains why there are two different variables x and x' with the 
same type in the expression (43). The variable x corresponds to the original value 
of the parameter (the input message) and x' represents an inference on it. The 
expression (43), in itself, has clear meaning as the relation (25) between averages. On 
the other hand, in the conventional derivation [2] of the Nishimori line, the insertion 
of the variable x to the left hand side of (36) looks a rather artificial procedure and 
the expression (43), which is defined in the configuration space enlarged by a gauge 
transformation, seems to have no definite meaning. This lacks of the interpretation 
is a reason why the derivation of the Nishimori line looks somewhat mysterious, even 
though the manipulation of the formula required in the proof is quite simple and 
elegant. We believe that our interpretation will contribute to make this point clear. 
Another interesting point of our argument is that it gives a novel interpretation to 
the identity (36) of the energy and the inequality (38). They are essentially necessary 
conditions that the average of the marginal likelihood takes the maximum value on 
the Nishimori line. 

The second advantage of our approach is that it suggests the existence of the 
correspondence of the Nishimori line in the models without gauge invariance. The 
relations (10), (12), and (7), which are used in the derivation of the Nishimori line, are 
obtained without the gauge invariance of the models. Thus, we can prove the identity 
of the energy, inequality of the specific heat, and the expression of the distribution of 
internal fields etc., which hold on the "Nishimori line" of the models without gauge 
invariance. For example, we consider a Bayesian model 




Ex^E^j) Vijx'jX'j) ■ exp(a^(i J ) Vij^Xj) exp(a VijXjXj) 



(43) 



Pa(y\x) = —cxp(-E a (x,y)) t 



(44) 



a 




(45) 




(46) 



v 



and 



K-y{x) = — exp(-E y (x,y)) t 



(47) 
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-E 7 (x) = 7 ^2 x i x j, ( 48 ) 

(i,3) 

Z 7 = ^2exp(-E y (x)). (49) 

X 

In this case, we assume that the unknown parameters {xi} and the data {yi} are 
defined on the vertices of a graph G with the degree N. When G is a two or three 
dimensional lattice, this model corresponds to an image restoration problem with a 
prior knowledge on images that is well described by the Ising prior (47) (For image 
restoration with Ising and Potts priors, see the references [20, 21, 22, 24, 25, 30, 31].). 
The posterior distribution of the model is 

Pay{x\y) = tt— exp(-£ Q7 (x, y)) t (50) 

-E aj (x,y) = a^2y i x i +'y'^2xiX j! (51) 

' (id) 

Z a ~t = ^2cxp(-E ai (x,y)). (52) 

X 

This is the Gibbs distribution of an Ising model with an inhomogeneous external field 
{h-Vi}. 

Let us consider the cases where the data generation mechanism is exactly 
described by the probabilities (44) with a = a and (47) with 7 = 70, i.e., the 
pattern of the random field {yi} is given by the following process : (i) Generate a 
sample pattern {yf 1 } from the Gibbs distribution of the Ising model (47) with the 
coupling constant 70. (ii) Flip each component {y} n } with the probability 

exp(-a ) , KQA 

1 = } — \ 7 r (53) 

exp(ao) + exp(-a ) 

(a binary symmetric channel). Then, we can interpret the posterior (50) as a Gibbs 
distribution of a Random Field Ising Model (RFIM). Note that external fields on the 
sites of this model are not mutually independent random variables, but correlated 
with a way specified by (i) and (ii). The "Nishimori line" of this model is defined as 
a surface where the parameters (ao,7o) in the definition of the quenched randomness 
coincide with (0,7) in the canonical average. Equivalcntly, 

exp(-a) 

r - ; 7 7 = 9 (54) 

exp(a) + exp(— a) 

7 = 7o, (55) 

where a and 7 are the parameters in (51), and q and 70 are the parameters in (i) and 
(ii). From (10)(11) and (5) with A(x) — XiXj, we can prove identities that holds on 
the Nishimori line of this model, 

[[§2yi x i} P « y (x\v)]p a (v\x)]^(x) =iV-tanha (56) 

i 

[[ x i x j)pa-i{x\y) \pa(v\x)]-n~,(x) — XiX 3^J Ure ' (^) 

(id) (id) 

[[ ( X i X j)pay(x\y) ]pa(y\x)]lTy(x) = ( X i X j)y . (58) 
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Here and hereafter, the average ( • • • ) p t ure is the canonical average with the "pure" 
Ising model with homogeneous couplings of the strength 7 on the same graph G. 

The expression (58) shows that the quenched average of the correlation of spins in 
the RFIM with a correlated random field is just the same as that of the corresponding 
pure Ising model. Furthermore, an identity between the order parameters is obtained 
if we consider a set of systems of a fixed boundary condition with which Xi = 1 for 
the all spins at the boundary [2]. That is, the random field {yi} is assumed to be 
generated by the process (i) and (ii) in which the values of the boundary spins are 
kept to 1, and the thermal averages {%i%j} Pon (x\y) an d { x i x j)^ ure are also taken with 
this boundary condition. Assuming that the site j belongs to the boundary and the 
site i is located far from the boundary, the relation 

[[ ( x i)p a -,(x\y) ]p a (y\x)]^(x) = mP ure (59) 

is derived from (58), where m v ™' e is the bulk magnetization per spin of the 
corresponding pure system. 

Although these results are dependent on the special features of the model, similar 
arguments are applicable in other models without gauge invariance and leads to 
identities and inequalities on the Nishimori line of the model. An example is provided 
by the posterior distribution corresponding to a binary asymmetric channel, which is 
already discussed in Sourlas [10] in the context of optimal decoding. It is related to 
models with a special type of site/bond randomness. 

There are, however, some intrinsic limitations on the utility of the notion of the 
Nishimori line without gauge invariance. First, we cannot simplify the definition of the 
quenched average [ [• • ■] Pa ( y \ x ) ]^(x) on the Nishimori line without gauge invariance. 
Then, the results usually contain a complicated quenched average, which often lacks 
a clear correspondence to that in physical systems. The two-stage process (i) and 
(ii) of the generation of quenched randomness in the RFIM just described above is a 
typical example of this. Another important remark is that not all of the arguments 
on the Nishimori line with gauge invariance is applicable to the models without gauge 
invariance. For example, the identity 

\{ X 'i)p a {x' i \y) ' ( X i)p a >(x'\v)]<l = \( X 'i)p a >(x'\y)]q ( 60 ) 

valid for the ± J spin glass model [2] has no correspondence in a model without gauge 
invariance. Here a and q is related by the condition (24) of the Nishimori line and a' 
takes an arbitrary value. The identity (60) is important because the upper bound 

\[( X 'i)p a ,(x'\y)]q\ < [\( x i)\p a (x\y)]q (61) 

of the order parameter is derived from it. If we substitute C(x,y) — X{ • (%i)p ,(x'.\y) 
in (7), we can prove the relation 

[[( x 'i)p a (x[\y) ■ ( x 'i) Pa ,(x'\y)]p a (y\x)]^(x) = [[ x i ' ( x 'i) Pa , (x'\y) ]p a (y\x)]w y (x) , (62) 

which apparently corresponds to (60). However, further simplification of the right 
hand side is not possible without gauge invariance. Unfortunately, the expression (62) 
gives little information on the shape of the boundaries in phase diagram and seems 
not so useful as (60). 

5. Finite-temperature Decoding 

The notion of the optimality of "finite temperature decoding" is introduced to the 
community of statistical physicists by Rujan [7] and discussed by Nishimori [8, 9] and 
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Sourlas [10]. Recently, it again draws an attention of the researchers of this area, 
because the development on the statistical mechanics of error-correcting codes [32] 
enables the quantitative argument of the problem with analytical methods. 

Roughly speaking, "the optimality of finite temperature decoding" means that the 
estimator that maximize the posterior probability (Maximum A Posteriori estimator) 
is not always the best estimator. The best estimator is dependent on the purpose of 
inference (or decoding) and often defined with averages over the posterior distribution. 
If we call MAP estimator, which is defined as a "ground state" of the corresponding 
physical system, the "zero-temperature decoder", it is natural to call an estimator 
defined with the posterior averages a "finite temperature decoder" or "T = 1 decoder" 
[10]. 

This fact, however, had been well known in the study of the statistics and pattern 
recognition. For example, Marroquin [39] (see also [24, 25]) discussed an estimator 
("MPM estimator") in image restoration problems, which is just the same as the one 
proposed by Rujan [7]. Moreover, it is not the first work that uses the estimator in 
this field %. General arguments on the optimality of the estimator in the Bayesian 
framework is already found in the textbooks [40, 41, 42, 12] of statistics. The branch 
of statistics that discuss optimal decisions with uncertain information is known as 
statistical decision theory. 

Here, we will briefly discuss the basic results on optimal estimators. Our 
treatment is not very different from the arguments in Sourlas [10] and those in the 
textbooks of statistics. It is, however, useful to give a coherent derivation with the 
notations in the earlier sections, because no comprehensive treatment on this subject 
seems available in the literature of physics. 

To give a formal definition of optimal estimators, we introduce the notion of a loss 
function L(x,x) that gives a measure of distance + between the original parameter x 
and an estimate x of x. Then we define an optimal estimator x(y) for a loss function 
L as a function of y that minimize the expected loss 

[[L(x,x(y))] p{y \ x) ] n{x) . (63) 

Here and hereafter we assume that we know exactly about the data generation process 
and omit the subscripts that indicate hyper parameters a, 7 in the expressions, say, 
p(y\x), ir(x) and ( ) (i.e., we set the values of the hyperparameters to their "true" 
values.). Note that the optimality of an estimator defined here is a very strong notion. 
It means that x(y) has better or equal average performance against any function of 
the data y, provided that the data generation scheme (or the set of the channel and 
the frequencies of the messages) is exactly described by the given probability p(y\x) 
and tt(x). It is not restricted to the optimality in a series of the estimators defined 
with different hyperparameters or temperatures. 

A basic result on optimal estimators is as follows: 

Lemma 

An optimal estimator x(y) for a loss function L is an estimator that minimize 
the posterior average (L(x, x(y))) p ( x \y) for each y. 

This result is a rather obvious one with the principle that the posterior distribution is 
the source of the all information from the data. If we use the identity (7), the formal 

K See, for example, the references [23] and Sec 2.4 of [21]. A recent work on the optimal estimator 

in image restoration is found in Rue [43]. 

+ It is not necessary to satisfy the axiom of the distance. 
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Table 1. Loss Functions and Corresponding Optimal Estimators. 
If there arc no special comments in the table, a component of parameters Xi 
and its estimate assumed to take their values in a subset of R and R 

respectably. The symbol ( ) indicates the average over the posterior distribution. 
An expression argmaxz f(z) indicates a value of z that maximizes f(z) and 
Kronceker delta 5 WtZ is defined as usual, i.e., 5 WjZ = 1 if w = z, else 8 WyZ = 0. 



loss function (L) 


optimal estimator (x) 


comments 




Xi — {Xi) 




Ei \ X i ~ X i 1 


Xi = median of Pi(xi) a 






{xi} = argmax a; p(x|2/) b 


Xi : a discrete variable 




Xi = argmax Xi Pi(xi) a 


Xi : a discrete variable 




A - <*«> 
X% ~ 


Xi e ±1 


-T,d x i l °E. x * + C 1 -Xi)log{l-Xi)} 


Xi {Xi) 


< Xi, Xi < 1 



a Here Pi{x{) indicates the marginal distribution Pi(xi) — J2{ Xj } j^iP( x \v) °^ x ii 

where !v x • , • means the summation over x with a fixed value of the ith component 
b It is often called "MAP (Maximum A Posteriori) estimator" . 



derivation of the lemma is easy. When we set C(x, y) = L(x, x{y)), the expression (7) 
gives 

[[( L ( x ', x (y)))p(x>\y)]p(y\x)Ux) = [[ L ( X ' X (y))Uy\x)Ux)- ( 64 ) 

Here we note three observations on the formulas (64): (a) The estimator x(y) is 
an arbitrary function of y and we can freely attribute its value at each y. (b) The 
average [ [• • ■] P ( V \ X ) ]-k(x) m the left hand side of (64) is an average over y with non- 
negative weights, (c) The function L(x' , x(y)) does not explicitly contains the original 
parameter x. By using (a),(b) and (c), we can see that the minimizer of the left side 
hand of (64) is the minimizer of the posterior average (L{x,x{y))) p ( x \ y ) for each y. 
Thus, the lemma is proved. 

For example, consider the case where the distance L between the binary sequence 
x = {xi} and x — {xi}(xi,Xi e {±1}) is measured by the overlap J2i x i x i °f the 
pattern, i.e., L(x,x) = —J2i x i x i- With this loss function, 

{L(X, X)) p(x \ y) x t(y)i x r}p(x\y). (65) 
i 

Then, the optimal estimator £i(y) G {±1}, which minimize the right hand side of 
(65), is given by 

^) = ii!t (a|1 V ( 66 ) 

I \ x i/p(x\y) I 

This expression coincides with the result in [7, 8, 10, 39, 24, 25]. Examples of loss 
functions and the corresponding optimal estimators are shown in the Table 1. By 
using the lemma, we can easily derive them. 

So far, our discussion in this section does not depend on the notion of the gauge 
invariance. Correspondences between loss functions and optimal estimators shown in 
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the Table 1 are independent of the existence of gauge invariance of the model. With 
gauge invariance, we can prove an additional result. Let us assume that the model is 
gauge invariant and the following properties of the loss function L and the estimator 

x{y) 

L(U z (x),U z (x))=L(x,x) (67) 

U z (x(y))=x(V z (y)) (68) 

are satisfied for all z (The mappings V z and U z are defined in the section 3.). Then, 
we can show that the expected loss [L(x, x(y))] p ( y \ x ) with any fixed x is independent 
of the value of x. The proof ([12] p.396, [42] p.168) is as follows: 

[ L (x,x(y))] p{vlx) = L ( x >Hy)) ■ p(v\ x ) 

y 

= Y J L{x\U z {x{y))) -p(y x (v)\x*) 
y 

= L(x*,x(V z (y))) ■ p(V z (y)\x*) 
y 

= £^,i(»)) -p(y\x*) 

y 

= [L(x*,x(y))] p{ylx . ) (69) 

where x* is an arbitrary chosen "standard" configuration, say a ferromagnetic state, 
and z is chosen to satisfy the relation x* = U z (x). The result (69) means that the 
estimator performs equally well for any value of the original parameter x. In terms 
of statistics [42] , an estimator that is optimal within the class of the estimators with 
such uniformity is called a minimum risk cquivariant estimator (MRE) . The case 
discussed in Rujan [7] and Nishimori [8] corresponds to a special example of MRE. 

In fact, we can remove the assumption (68) on the estimator, if the estimator 
is optimal and the optimal estimator is known to be unique. This means that if the 
loss function is gauge invariant, the corresponding optimal estimator is automatically 
gauge covariant and satisfies (68) ft • The proof is easy, if we note that the estimator 
defined by 

x z (y) = U z 1 (x(V z (y))) (70) 

is an estimator of the equal performance to the original estimator x(y), i.e., 

[[L(x,x z (y))] p{v \ x) ] w{x) = [[L{x,x{y))] p{y \ x) ] w{x) . (71) 

The relation (71) is confirmed by the calculation similar to that in the proof of (69) 
under the assumption of (67) and the gauge covariance of p(y\x) and tt(x). Thus, with 
the assumption of the uniqueness of the optimal estimator, x z (y) should be coincides 
with x(y) for any value of z. It proves the relation (68). 

* The term "invariant" is also used. The author prefers "covariant", but does not know whether it 
has been used by statisticians. Here we restrict ourselves within the special form of U z and V z induced 
by the gauge transformation group of Ising spin glass. See the reference [12, 42] for definitions and 
results with an arbitrary group of transitive transformations. 

ft It is not true without an additional assumption on the uniqueness. A counter example is given by 
a binary symmetric channel with extreme noise q = 1/2, which transmits no information. For this 
example, any estimator is "optimal" for L(x,x) = — '^2 i XiXi. Some of them, say an estimator that 
returns a constant as estimates, are evidently not a MRE. 
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